Project - Applied Statistics

by HARI SAMYNAATH S

Part ONE

1. Question: Please refer the table below to answer below questions:

Planned Purchased NotPurchased Total
Yes 400 100 500
No 200 1300 1500
Total 600 1400 2000
  1. Refer to the above table and find the joint probability of the people who planned to purchase and actuallyplaced an order.
  1. Refer to the above table and find the joint probability of the people who planned to purchase and actuallyplaced an order, given that people planned to purchase.

2. Question: An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions.

A.Probability that none of the items are defective?

B.Probability that exactly one of the items is defective?

C.Probability that two or fewer of the items are defective?

also given by

bin_pmf[:3].sum()

D.Probability that three or more of the items are defective ?

also given by

bin_pmf[3:].sum()

3. Question: A car salesman sells on an average 3 cars per week.

A. Probability that in a given week he will sell some cars.

B. Probability that in a given week he will sell 2 or more but less than 5 cars.

C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold perweek.

4. Question: Accuracy in understanding orders for a speech based bot at a restaurant is important for the Company X which has designed, marketed and launched the product for a contactless delivery due to the COVID-19 pandemic. Recognition accuracy that measures the percentage of orders that are taken correctly is 86.8%. Suppose that you place order with the bot and two friends of yours independently place orders with the same bot. Answer the following questions.

A. What is the probability that all three orders will be recognised correctly?

B. What is the probability that none of the three orders will be recognised correctly?

C. What is the probability that at least two of the three orders will be recognised correctly?

5. Question: A group of 300 professionals sat for a competitive exam. The results show the information of marks obtained by them have a mean of 60 and a standard deviation of 12. The pattern of marks follows a normal distribution. Answer the following questions.

A. What is the percentage of students who score more than 80.

B. What is the percentage of students who score less than 50.

C. What should be the distinction mark if the highest 10% of students are to be awarded distinction?

6. Question: Explain 1 real life industry scenario [other than the ones mentioned above] where you can use the concepts learnt in this module of Applied statistics to get a data driven business solution.

In a use-case where, for instance, we have part supply demand vs shortage to be kept in control, and we have to explore the probabilites of a particular vendor (say x) could cause part shortage for say n times. (This would call for Poissons probability function)
Such a study will help risk management and strategic planning

======================================================================================================


Part TWO

CONTEXT: Company X manages the men's top professional basketball division of the American league system. The dataset contains information on all the teams that have participated in all the past tournaments. It has data about how many baskets each team scored, conceded, how many times they came within the first 2 positions, how many tournaments they have qualified, their best position in the past, etc.

OBJECTIVE: Company’s management wants to invest on proposal on managing some of the best teams in the league. The analytics department has been assigned with a task of creating a report on the performance shown by the teams. Some of the older teams are already in contract with competitors. Hence Company X wants to understand which teams they can approach which will be a deal win for them.

Steps and tasks:

  1. Read the data set, clean the data and prepare a final dataset to be used for analysis.

There are 61 rows and 13 columns

The columns are given by

Sl Column Name Description
1. Team Team’s name
2. Tournament Number of played tournaments.
3. Score Team’s score so far.
4. PlayedGames Games played by the team so far.
5. WonGames Games won by the team so far.
6. DrawnGames Games drawn by the team so far.
7. LostGames Games lost by the team so far.
8. BasketScored Basket scored by the team so far.
9. BasketGiven Basket scored against the team so far.
10. TournamentChampion How many times the team was a champion of the tournaments so far.
11. Runner-up How many times the team was a runners-up of the tournaments so far.
12. TeamLaunch Year the team was launched on professional basketball.
13. HighestPositionHeld Highest position held by the team amongst all the tournaments played.

it is seen that many columns are wrongly represented as object datatype
there seems to be no nulls

Type casting creates error where non numerical values are present
A. Major invalid data is found to be "-"
B. Team Launch column has mixed entries which has to be processed furhter
C. Last entry at 60th index has maximum "-"

as said above, there are several "-" in row index 60
lets drop that row as the team have never played games, leaving us with no clues about their performance / capabilites

Lets impute the remaining "-" in the "TournamentChampion" and "Runner-up" columns

The TeamLaunch column contains year of launch, either in Gregorian calendar year like YYYY or probably financila/academic year ranges like YYYY-YY
Hence lets extract the Launch Year in Gregorian Format of YYYY alone

now lets convert all the colums (expect Team) in to integer types

All columns converted without error

Rearrange dataset and convert Team name as the index
and lets check the info & description

All invalid data points are cleaned and is ready for analysis

  1. Perform detailed statistical analysis and EDA using univariate, bi-variate and multivariate EDA techniques to get a data driven insights on recommending which teams they can approach which will be a deal win for them.. Also as a data and statistics expert you have to develop a detailed performance report using this data.

The attributes of the data are varying in scales ranging from 10s to 1000s

Though it is hard to visualise the distribution of individual attributes due to huge scaling factors, we are able to find significant outliers in 3 columns.
Those outliers in Score, WonGames, BasketScored cannot be excluded from datapoints as those exceptional performances are defining the top teams

The attributes doesn't follow normal distribution, probably because of various generations of teams being compared here (TeamLaunch ranges over 60 years)
lets study attribute-wise distribution to get a better picture

Interactive Unitvariate Analysis

with most of the attributes being right skewed and none following a normal distribution,
it will be difficult to determine better performing teams

hence lets us explore more with bivariate analysis

quite a lot of attributes are found to be related either positively or inversely
let us review the correlation coefficient to measure the relationships

Feature Selection & Engineering

Lets us study the relationships with refined attributes of the set of teams

Now, the refined attributes define the group of teams more accurately

Having arrived at meaningful qualities of the group of teams
one may intuitively choose teams with high WinRatio to invest on
So lets see if that is a worthy of investment

Interestingly, Yes the teams with high WinRatios have been TournamentChampions for several times (Teams 1 to 5)
But those are the oldest teams amongst the group, and are expected to have been contract with Competitors

So who are we left with? with only young teams!!!
Surprisingly, among teams not older than 25 years, there are 2 budding performers with high perseverence
Teams 21 & 25 has shown high interest to frequently play

Compare teams of choice

(against teams of age<=25)

Fact Summary

We Recommend Company X to invest on Teams 21 & 25 for assured grand success

  1. Please include any improvements or suggestions to the association management on quality, quantity, variety, velocity, veracity etc. on the data points collected by the association to perform a better data analysis in future.

To the association management:

======================================================================================================


Part THREE

CONTEXT: Company X is a EU online publisher focusing on the startups industry. The company specifically reports on the business related to technology news, analysis of emerging trends and profiling of new tech businesses and products. Their event i.e. Startup Battlefield is the world’s pre-eminent startup competition. Startup Battlefield features 15-30 top early stage startups pitching top judges in front of a vast live audience, present in person and online.

OBJECTIVE: Analyse the data of the various companies from the given dataset and perform the tasks that are specified in the below steps. Draw insights from the various attributes that are present in the dataset, plot distributions, state hypotheses and draw conclusions from the dataset.

Steps and tasks:

  1. Data warehouse:
    • Read the CSV file.

ATTRIBUTE INFORMATION

Each row in the dataset is a Start-up company and the columns describe the company

Column Description
Startup Name of the company
Product Actual product
Funding Funds raised by the company in USD
Event The event the company participated in
Result Described by Contestant, Finalist, Audience choice, Winner or Runner up
OperatingState Current status of the company, Operating ,Closed, Acquired or IPO
  1. Data exploration:
    • Check the datatypes of each attribute.

All the attributes are found to be of object datatype Going forwards, Funding column must be considered for appropriate conversion

  1. Data exploration:
    • Check for null values in the attributes.

There are a total of 210 + 4 + 2 = 216 records with nulls or Nans

  1. Data preprocessing & visualisation:
    • Drop the null values.
  1. Data preprocessing & visualisation:
    • Convert the ‘Funding’ features to a numerical value.
  1. Data preprocessing & visualisation:
    • Plot box plot for funds in million.

The above graph indicates heavy skewness in data, also depicting a whole lot of 60 records of Funding values as outliers.
But comparing with the sample size of just 446, the count of outliers is acounting to 13.45%
labeling more than 10% of available sample as outliers and excluding them from further analysis will greatly influence the sample data distribution
hence let us try transforming the Funding data, to obtain better clarity on data

Experiment further
try to rescale the Funding information

  1. Data preprocessing & visualisation:
    • Get the lower fence from the box plot.
  1. Data preprocessing & visualisation:
    • Check number of outliers greater than upper fence.
  1. Data preprocessing & visualisation:
    • Drop the values that are greater than upper fence.
  1. Data preprocessing & visualisation:
    • Plot the box plot after dropping the values.
  1. Data preprocessing & visualisation:
    • Check frequency of the OperatingState features classes.

out of 443 companies under study, 386 are functional & 57 have been closed

  1. Data preprocessing & visualisation:
    Plot a distribution plot for Funds in million.

notably, log transform has resulted in a normal distribution

  1. Data preprocessing & visualisation:
    • Plot distribution plots for companies still operating and companies that closed.
  1. Statistical analysis:
    • Is there any significant difference between Funds raised by companies that are still operating vs companies that closed down?
    • Write the null hypothesis and alternative hypothesis.
    • Test for significance and conclusion

Based on above plots,

The above description will suggest that the means and spread of Funding significantly varies between Operating & closed Companies
But, the influence of skewness could raise an ambiguity over the inference
Hence lets review the same in log transformed data

The previous inference is supported by log transformed data also
Funds allocated to Closed companies we far less than those allocated to successfully operating companies

Lets us also verify the same using a 2 sample t test
Null Hypothesis Ho
Funds allocated to either classification of companies are similar
Alternate Hypothesis Ha
Funds allocated to Operating companies significantly vary than that of Closed companies

Conclusion: The above test reiterates that the funds allocated are not similar

  1. Statistical analysis:
    • Make a copy of the original data frame.
  1. Statistical analysis:
    • Check frequency distribution of Result variable.
  1. Statistical analysis:
    • Calculate percentage of winners that are still operating and percentage of contestants that are still operating

Considering all recognised companies as winners, for the sake of analysis

  1. Statistical analysis:
    • Write your hypothesis comparing the proportion of companies that are operating between winners and contestants:
    • Write the null hypothesis and alternative hypothesis.
    • Test for significance and conclusion

Z test of proportions
Null Hypothesis Ho
Proportion of Winner Companies & Contestant Companies are similar
Alternate Hypothesis Ha
Proportion of Winner Companies & Contestant Companies are significantly different

Conclusion: compannies recognised in the Startup Battlefield event have survived better than the remaining contestants

  1. Statistical analysis:
    • Check distribution of the Event variable.

TC50 2008 & 2009 has seen maximum number of contestants

  1. Statistical analysis:
    • Select only the Event that has disrupt keyword from 2013 onwards.
  1. Statistical analysis:
    • Write and perform your hypothesis along with significance test comparing the funds raised by companies across NY, SF and EU events from 2013 onwards.

one way test
Null Hypothesis Ho
funds across 3 cities are same
Alternate Hypothesis Ha
funds across 3 cities are different

Hence the distribution of funds across the 3 cities are similar

  1. Statistical analysis:
    • Plot the distribution plot comparing the 3 city events.

Suggestions: provide details on funding reasons to filter better.